Initialization

Loading the dataset and looking at its structure and variables

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Initial analysis of the data

Quality of wines

The qualities of wine seem to be distributed around the median of 6. The tail tail is slightly higher on the lower-quality side, with 5-quality wines being by far the 2nd most numerous quality after 6. It also seems that no wines were given either a 10, or 0-2. Additionally, only 5 wines were of quality 9. As vast majority of wines seem to have a quality of either 5 or 6.

After running ggpairs to view the relationships of features, the plots and the correlation values seem to indicate a poor correlation between the independant variables and the wine quality. This seems quite understandable as it is difficult to imagine there being linear relationships between wine quality and for example salt-, sugar-, and alcohol-content of the wine or acidity.

GGpairs output is not shown here as it looks poor on knit html. To get a clearer view of the variables relationships with wine quality, I will plot them as scatterplots:

Run all the possible independent variables vs wine quality. Also plot lines for mean and median. Omit outliers from the independent variables. Use alpha to gain a clearer view of the independent variable variance. Jitter added to make the plots more effective.

## Warning: Removed 75 rows containing missing values (stat_summary).
## Warning: Removed 75 rows containing missing values (stat_summary).
## Warning: Removed 98 rows containing missing values (geom_point).

## Warning: Removed 82 rows containing missing values (stat_summary).
## Warning: Removed 82 rows containing missing values (stat_summary).
## Warning: Removed 114 rows containing missing values (geom_point).

## Warning: Removed 68 rows containing missing values (stat_summary).
## Warning: Removed 68 rows containing missing values (stat_summary).
## Warning: Removed 94 rows containing missing values (geom_point).

## Warning: Removed 81 rows containing missing values (stat_summary).
## Warning: Removed 81 rows containing missing values (stat_summary).
## Warning: Removed 98 rows containing missing values (geom_point).

## Warning: Removed 87 rows containing missing values (stat_summary).
## Warning: Removed 87 rows containing missing values (stat_summary).
## Warning: Removed 405 rows containing missing values (geom_point).

## Warning: Removed 90 rows containing missing values (stat_summary).
## Warning: Removed 90 rows containing missing values (stat_summary).
## Warning: Removed 106 rows containing missing values (geom_point).

## Warning: Removed 98 rows containing missing values (stat_summary).
## Warning: Removed 98 rows containing missing values (stat_summary).
## Warning: Removed 98 rows containing missing values (geom_point).

## Warning: Removed 98 rows containing missing values (stat_summary).
## Warning: Removed 98 rows containing missing values (stat_summary).
## Warning: Removed 3538 rows containing missing values (geom_point).

## Warning: Removed 85 rows containing missing values (stat_summary).
## Warning: Removed 85 rows containing missing values (stat_summary).
## Warning: Removed 95 rows containing missing values (geom_point).

## Warning: Removed 84 rows containing missing values (stat_summary).
## Warning: Removed 84 rows containing missing values (stat_summary).
## Warning: Removed 105 rows containing missing values (geom_point).

## Warning: Removed 78 rows containing missing values (stat_summary).
## Warning: Removed 78 rows containing missing values (stat_summary).
## Warning: Removed 136 rows containing missing values (geom_point).

Looking at the plots it is evident that the data is very dense at qualities of 5 and 6, and significantly less dense at other qualities.

The mean and median of features seems quite similar (largest outliers have been omitted from the plots, which will exaggarate this). Additionally, the correlation of all features with wine quality seem very small or nonexistant. The mean and median values of the features change very little between each wine quality, with the exception of alcohol, which seems to decrease at first for low quality wines, and then linearly increase as quality increases.

Overall, the variance in the variables seems very high in most cases. Chlorides seems to have the lowest variance for good quality wines, but funnily enough the chloride level seems very similar between good and poor wines!

In order to get a better estimate on the variance in the independent variable values, lets create boxplots of the relationships between the independent variables and wine quality.

Sulphates, citric.acid and pH seem to not affect the wine quality at all, since their mean/median seems very static and variance very high in all wine qualities. Also the chlorides, free.sulphur.dioxides seem to differ very little between wine qualities. Therefore these variables will not be studied further in the boxplots.

Box plots of wine quality and alcohol, density, chlorides, fixed.acidity, residual.sugar, and volatile.acidity

The boxplots further identify the problem: The features simply do not seem to be enough in explaining the quality of good wines. The most of the features have high variance in wines of good quality. And the features show very low, nonlinear correlation.

The features that seem to be correlating the most with wine quality seem to be: - Alcohol, where alcohol start a bit higher at poor wines, decreasing until quality of 5 and afterwards increasing somewhat linearly. - density, where better quality wines seem less dense. - volatile.acidity, where better quality wines seem to be slightly less acidic. An outlier here are wines of quality 3, which do not follow a similar curve as the rest of the qualities, but this may very well occur because there are only 20 wines of quality 3. - Also residual sugar showed some correlation with wine quality. The correlation, however, looks very non-monotonic, where the average wines seemed sweeter than poor and good wines.

Lets look at these three relationships further.

Plots and summary

To get an idea of the features, lets first look at some descriptive statistics of alcohol, density, and volatile.acidity when they are group by wine quality. Quality of 9 and 3 wines are omitted as there are very few wines rated as such.

## Source: local data frame [5 x 4]
## 
##   quality mean_alcohol median_alcohol variance_alcohol
## 1       4     10.15245           10.1        1.0064446
## 2       5      9.80884            9.5        0.7175196
## 3       6     10.57537           10.5        1.3173902
## 4       7     11.36794           11.4        1.5538515
## 5       8     11.63600           12.0        1.6387540
## Source: local data frame [5 x 4]
## 
##   quality mean_density median_density variance_density
## 1       4    0.9942767        0.99410     6.063201e-06
## 2       5    0.9952626        0.99530     6.475670e-06
## 3       6    0.9939613        0.99366     9.141611e-06
## 4       7    0.9924524        0.99176     7.659998e-06
## 5       8    0.9922359        0.99164     7.771407e-06
## Source: local data frame [5 x 4]
## 
##   quality mean_volatile.acidity median_volatile.acidity
## 1       4             0.3812270                    0.32
## 2       5             0.3020110                    0.28
## 3       6             0.2605641                    0.25
## 4       7             0.2627670                    0.25
## 5       8             0.2774000                    0.26
## Variables not shown: variance_volatile.acidity (dbl)

Looking at the mean and median of alcohol, they seem to follow a clear trend of increasing at a non-linear, slowing rate. The variance of alcohol at all quality levels is high though.

Looking at the mean and median of density, the differences at different quality levels seems very low. The variance is also very low.

Volatile.acidity mean and median seem to decrease at a slowing rate, converging at somewhere around 0.26. Similar to density, the variance seems quite low compared to other variables (although still not quite as low as in density).

Having this information, lets look at the boxplots of these 3 features closer. The boxplots are zoomed in a way that we can focus on the relationships around the mean and middle quantiles.

Alcohol vs quality

Density vs quality

Fixed.acidity vs quality

The boxplots seem to give further information to the correlation values discovered earlier. The relationship between density and alchol with wine quality displayed in the boxplots seemed to indicate slight nonlinear relationships. To relationships such as these, the pearson correlations calculated in the ggpairs-plots earlier may give some misleading correlation values that underestimate their actual correlation.

Lets try Spearman correlation instead

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747
## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233
## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$volatile.acidity
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2215214 -0.1676307
## sample estimates:
##       cor 
## -0.194723
## Warning in cor.test.default(wines$quality, wines$alcohol, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  wines$quality and wines$alcohol
## S = 1.096e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4403692
## Warning in cor.test.default(wines$quality, wines$density, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  wines$quality and wines$density
## S = 2.6406e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.348351
## Warning in cor.test.default(wines$quality, wines$volatile.acidity, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  wines$quality and wines$volatile.acidity
## S = 2.3434e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1965617

The correlation between density and wine quality is now slightly higher. Alcohol and volatile.acidity changed very little which can be expected by looking at the boxplots.

As seen in the boxplots, the relationships of quality and especially density seem to be slightly non-monotonic, therefore breaking the assumptions made by Spearman correlation on the data, causing us to suspect the Spearman correlations validity as well.

Lets try to build a linear model for wine quality using independent variables with the largest perceived correlation.

wines$quality <- as.numeric(wines$quality)
m1 <- lm(I(quality) ~ I(volatile.acidity), data = wines)
m2 <- update(m1, ~ . + density)
m3 <- update(m2, ~ . + alcohol)

Since residual.sugar and sulphates showed some correlation, lets try adding it to the model as well:

m4 <- update(m3, ~ . + residual.sugar)
m5 <- update(m4, ~ . + sulphates)
mtable(m1,m2,m3,m4,m5)
## 
## Calls:
## m1: lm(formula = I(quality) ~ I(volatile.acidity), data = wines)
## m2: lm(formula = I(quality) ~ I(volatile.acidity) + density, data = wines)
## m3: lm(formula = I(quality) ~ I(volatile.acidity) + density + alcohol, 
##     data = wines)
## m4: lm(formula = I(quality) ~ I(volatile.acidity) + density + alcohol + 
##     residual.sugar, data = wines)
## m5: lm(formula = I(quality) ~ I(volatile.acidity) + density + alcohol + 
##     residual.sugar + sulphates, data = wines)
## 
## ===========================================================================
##                          m1         m2         m3         m4         m5    
## ---------------------------------------------------------------------------
## (Intercept)           6.354***   95.245*** -36.499***  74.225***  96.322***
##                      (0.036)     (3.927)    (6.001)   (11.977)   (12.376)  
## I(volatile.acidity)  -1.711***   -1.639***  -2.072***  -2.059***  -2.022***
##                      (0.123)     (0.117)    (0.110)    (0.109)    (0.109)  
## density                         -89.445***  38.992*** -71.546*** -93.896***
##                                  (3.951)    (5.920)   (11.923)   (12.335)  
## alcohol                                      0.399***   0.286***   0.261***
##                                             (0.014)    (0.018)    (0.018)  
## residual.sugar                                          0.052***   0.061***
##                                                        (0.005)    (0.005)  
## sulphates                                                          0.657***
##                                                                   (0.099)  
## ---------------------------------------------------------------------------
## R-squared                0.038      0.129      0.247      0.264      0.271 
## adj. R-squared           0.038      0.129      0.246      0.263      0.270 
## sigma                    0.869      0.827      0.769      0.760      0.757 
## F                      192.958    362.791    534.843    438.646    362.919 
## p                        0.000      0.000      0.000      0.000      0.000 
## Log-likelihood       -6259.952  -6016.114  -5660.164  -5604.126  -5581.981 
## Deviance              3695.351   3345.142   2892.625   2827.187   2801.737 
## AIC                  12525.903  12040.228  11330.329  11220.251  11177.961 
## BIC                  12545.393  12066.214  11362.812  11259.231  11223.437 
## N                     4898       4898       4898       4898       4898     
## ===========================================================================

The linear model seems to explain the quality of a wine very poorly. It explains only 27.1% of the variance in wine quality.

Lets see what went wrong:

summary(m5)
## 
## Call:
## lm(formula = I(quality) ~ I(volatile.acidity) + density + alcohol + 
##     residual.sugar + sulphates, data = wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3051 -0.4964 -0.0384  0.4616  3.1776 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          96.32177   12.37637   7.783 8.60e-15 ***
## I(volatile.acidity)  -2.02151    0.10859 -18.617  < 2e-16 ***
## density             -93.89603   12.33453  -7.612 3.21e-14 ***
## alcohol               0.26084    0.01808  14.428  < 2e-16 ***
## residual.sugar        0.06091    0.00506  12.037  < 2e-16 ***
## sulphates             0.65729    0.09860   6.666 2.92e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7568 on 4892 degrees of freedom
## Multiple R-squared:  0.2706, Adjusted R-squared:  0.2698 
## F-statistic: 362.9 on 5 and 4892 DF,  p-value: < 2.2e-16
anova(m5)
## Analysis of Variance Table
## 
## Response: I(quality)
##                       Df  Sum Sq Mean Sq F value    Pr(>F)    
## I(volatile.acidity)    1  145.64  145.64 254.294 < 2.2e-16 ***
## density                1  350.21  350.21 611.485 < 2.2e-16 ***
## alcohol                1  452.52  452.52 790.122 < 2.2e-16 ***
## residual.sugar         1   65.44   65.44 114.259 < 2.2e-16 ***
## sulphates              1   25.45   25.45  44.436 2.918e-11 ***
## Residuals           4892 2801.74    0.57                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The standard error seems high for density. It is quite a bit higher than the rest. The Pf(>t)-value seems quite low though, which causes me to believe with high confidence that all of the 4 independent variables do affect wine quality. The F-values and the p-tests of the independent variables all lead me to believe that the variables improve the regression model from the mere intercept-model.

Despite the fact that the independent variables used in the linear model obviously affect the model in a positive manner, the model created still seems ill suited to explain the wine quality, as seen from the low R-squared value.

Reflection

The features provided seem ill suited to explain wine quality. This is quite understandable as the quality of wine should logically not have a linear positive correlation with features such as acidity, sweetness (sugar) or alcohol. There are numerous good wines that can be either sweet or not-so-sweet. This also explains the high variance of these features in wine qualities and the fact that some features could be quite similar in both low quality and high quality wines. It all comes down to the combination of different flavors and of course, personal preference.

The dataset was also quite poor, as almost all of the wines in the dataset had qualities of 5-7. It would have been interesting if the dataset had contained more data on very high and very low quality wines.